Home:ALL Converter>Break document sections into list for export Python

Break document sections into list for export Python

Ask Time:2017-04-14T05:17:23         Author:SultanGromma

Json Formatter

I am very new to Python, and I am trying to break some legal documents into sections for export into SQL. I need to do two things:

  1. Define the section numbers by the table of contents, and
  2. Break up the document given the defined section numbers

The table of contents lists section numbers: 1.1, 1.2, 1.3, etc.

Then the document itself is broken up by those section numbers: 1.1 "...Text...", 1.2 "...Text...", 1.3 "...Text...", etc.

Similar to the chapters of a book, but delimited by ascending decimal numbers.

I have the document parsed using Tika, and I've been able to create a list of sections with some basic regex:

import tika
import re

from tika import parser
parsed = parser.from_file('test.pdf')
content = (parsed["content"])

headers = re.findall("[0-9]*[.][0-9]",content)

Now I need to do something like this:

splitsections = content.split() by headers

var_string = ', '.join('?' * len(splitsections))
query_string = 'INSERT INTO table VALUES (%s);' % var_string
cursor.execute(query_string, splitsections)

Sorry if all this is unclear. Still very new to this.

Any help you can provide would be most appreciated.

Author:SultanGromma,eproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article:https://stackoverflow.com/questions/43401861/break-document-sections-into-list-for-export-python
yy